LSTM-based Khmer Text Style Transfer Using Representation Learning

Pel Bunkhloem Phoeurn Kimhor
Pheth Soriyuon Phorn Sreypov
Phoeun Rajame Yin Sambat

Supervised by:
Dr. KHON Vanny Mr. TOUCH Sopheak

02/01/2026

1 Problem Statement

The task is formulated as a sequence-to-sequence style transfer problem, where an input sentence in normal Khmer is transformed into its corresponding royal-style sentence.

Let:

  • \(X = (x_1, x_2, \dots, x_T)\) denote the input sequence
  • \(Y = (y_1, y_2, \dots, y_{T'})\) denote the output sequence
  • \(f_\theta\) denote an encoder–decoder LSTM model

1.1 Model Objective and Notation

Conditional Probability

\[ P(Y \mid X; \theta) = \prod_{t=1}^{T'} P\!\left(y_t \mid y_1, \dots, y_{t-1}, X; \theta\right) \]

Training Objective

\[ \theta^* = \arg\max_{\theta} \sum_{(X,Y)\in\mathcal{D}} \log P(Y \mid X; \theta) \]

Notation

  • \(x_t, y_t\) : tokens at time step \(t\)
  • \(T, T'\) : input and output lengths
  • \(\theta\) : model parameters
  • \(h_t\) : encoder hidden state

2 Pre-training

2.1 Data Description

  • Sources: Diverse online Khmer articles (news, blogs, social media, wikipedia).
  • Technology to scrape: Selenium.
  • Count of data: 470,000 rows

2.2 Data Preparation & Preprocessing

Cleaning Pipeline

  1. Remove Noise: special punctuation (។, ៗ).
  2. Script Focus: Strip Latin chars/digits.
  3. Normalization: Map redundant glyphs (e.g., ឝ គ).
  4. Trim: Remove extra whitespace.

Dataset Splitting

  • Filter: Min length chars.
  • Split: 80% Train | 10% Val | 10% Test.

2.3 Tokenization

  • Method: Character-to-index mapping (stoi).
  • Special Tokens: <sos>, <eos>, <pad>, <unk>.
  • Batching: Sequences aligned via pad_sequence.

2.4 Architecture Overview

2.4.1 LSTM Autoencoder Architecture

Diagram: LSTM Autoencoder Architecture

  • Limited parallel data: Only 793 normal ↔︎ royal sentence pairs
  • Language foundation: Model must understand Khmer structure before learning style
  • Unsupervised approach: Autoencoder learns from 30K general Khmer sentences

2.4.2 Workflow of LSTM Autoencoder

Diagram: Flow from Input through Encoder to Lower-Dimensional Space, then through Decoder back to Output

3 Fine-tuning

3.1 Dataset: Normal to Royal Transfer

Parallel Corpus

  • Source: Scraped from various articles.
  • Volume: 793 sentence pairs.
  • Format:
    • Input: Normal Khmer (សាមញ្ញ)
    • Target: Royal Khmer (រាជស័ព្ទ)

Data Structure Example

Normal Royal
លោកបានដើរកាត់តាមឆ្នេរសមុទ្រ ព្រះអង្គស្តេចយាងកាត់តាមឆ្នេរសមុទ្រ

3.2 Preprocessing & Alignment

Text Cleaning

  • Punctuation: Removed traditional markers (៙, ៚, ៖, ។).
  • Normalization: Whitespace trimming

Tokenization

  • Level: Character-level mapping (stoi).
  • Wrapping: Added <sos> and <eos> to every sequence.
  • Splitting Strategy:
    • Train: 80%
    • Validation: 10%
    • Test: 10%

3.3 Attention Mechanism

The mechanism begins by calculating how well each encoder state \(h_s^{enc}\) matches the current decoder needs.

Alignment Score: Measures the relevance of input \(s\) at decoding step \(t\): \[e_{t,s} = h_{t-1}^{dec} \cdot h_s^{enc}\]

Attention Weight: Normalizes scores into a probability distribution using Softmax: \[\alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{k=1}^{T} \exp(e_{t,k})}\]

The model then extracts relevant information to generate the final character.

Context Vector (\(c_t\)): A weighted sum of all encoder hidden states: \[c_t = \sum_{s=1}^{T} \alpha_{t,s} h_s^{enc}\]

Final Prediction: The decoder hidden state \(h_t^{dec}\) is updated with \(c_t\), and the next character is predicted: \[P(y_t \mid y_{<t}, X) = \text{Softmax}(W_{hy} h_t^{dec} + b_y)\]

4 Results

Pre-training BLEU (test):

Model BLEU (%)
General Text 30.1
Folktale Text 9.4

Fine-tuning (samples):

Generated Output Actual Output BLEU
ព្រះមហាក្សត្រប្រទានព្រះរាជបន្ទូលថា បាន ព្រះមហាក្សត្រប្រទានព្រះរាជទ្រព្យជួយរាស្ 0.79
ព្រះនាងមិនឱ្យមានព្រះរាជបុត្រពីព្រះនាងមា ព្រះនាងមិនឱ្យភិលៀងធ្វើព្រះរាជកិច្ចជំនួស 0.73